arria 10
Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators
Pogue, Trevor E., Nicolici, Nicola
We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the unrelated Winograd minimal filtering algorithms for convolutional layers, FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication, including fully-connected, convolutional, recurrent, and attention/transformer layers. We implement FIP for the first time in an ML accelerator then present our FFIP algorithm and generalized architecture which inherently improve FIP's clock frequency and, as a consequence, throughput for a similar hardware cost. Finally, we contribute ML-specific optimizations for the FIP and FFIP algorithms and architectures. We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget. Our FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point inputs achieves higher throughput and compute efficiency than the best-in-class prior solutions on the same type of compute platform.
Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing
Dua, Akshay, Li, Yixing, Ren, Fengbo
This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing. The existing OpenCL-defined FPGA accelerators for CNN inference are insufficient due to limited flexibility for supporting multiple CNN models at run time and poor scalability resulting in underutilized FPGA resources and limited computational parallelism. Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism for accelerating CNN inference on FPGAs. Systolic-CNN is highly scalable and parameterized, which can be easily adapted by users to achieve up to 100% utilization of the coarse-grained computation resources (i.e., DSP blocks) for a given FPGA. Systolic-CNN is also run-time-flexible in the context of multi-tenancy cloud/edge computing, which can be time-shared to accelerate a variety of CNN models at run time without the need of recompiling the FPGA kernel hardware nor reprogramming the FPGA. The experiment results based on an Intel Arria/Stratix 10 GX FPGA Development board show that the optimized single-precision implementation of Systolic-CNN can achieve an average inference latency of 7ms/2ms, 84ms/33ms, 202ms/73ms, 1615ms/873ms, and 900ms/498ms per image for accelerating AlexNet, ResNet-50, ResNet-152, RetinaNet, and Light-weight RetinaNet, respectively. Codes are available at https://github.com/PSCLab-ASU/Systolic-CNN.
Deep learning processing unit delivers 135 GOPS/W on midrange FPGAs
The Omnitek deep learning processing unit (DPU) employs a novel mathematical framework combining low-precision fixed point maths with floating point maths to achieve 135 GOPS/W at full 32-bit floating point accuracy when running the VGG-16 CNN in an Arria 10 GX 1150. Scalable across a wide range of Arria 10 GX and Stratix 10 GX devices, the DPU can be tuned for low cost or high performance in either embedded or data centre applications. The DPU is fully software programmable in C/C or Python using standard frameworks such as TensorFlow, enabling it to be configured for a wide range of standard CNN models including GoogLeNet, ResNet-50 and VGG-16 as well as custom models. No FPGA design expertise is required to do this. "We are very excited to apply this unique innovation, resulting from our joint research program with Oxford University, to reducing the cost of a whole slew of AI-enabled applications, particularly in video and imaging where we have a rich library of highly optimised IP to complement the DPU and create complete systems on a chip", commented Roger Fawcett, CEO at Omnitek.
Intel FPGAs Break Record for Deep Learning Facial Recognition - insideHPC
Today Intel announced record results on a new benchmark in deep learning and convolutional neural networks (CNN). Developed with ZTE, a leading technology telecommunications equipment and systems company, the image recognition technology is what many companies in Internet search and AI are trying to advance. Perception, such as recognizing a face in an image, is one of the essential goals of the ZTE 5G System," said Duan Xiangyang, vice president of the ZTE Wireless Institute. "Deep learning technology is very important as it can enable such perception in mobile edge computing systems, thus making ZTE's 5G System smarter." The test took place in Nanjing City, China, where ZTE's engineers used Intel's midrange Arria 10 FPGA for a cloud inferencing application using a CNN algorithm. ZTE has achieved a new record – beyond a thousand images per second in facial recognition – with what is known as "theoretical high accuracy" achieved for their custom topology. Intel's Arria 10 FPGA accelerated the raw design performance more than 10 times while maintaining the accuracy. The Arria 10 FPGA provides up to 1.5 teraflops (TFLOPs) single precision floating-point processing performance, 1.15 million logic elements, and more than a terabit-per-second high-speed connectivity. Such deep learning designs can be seamlessly migrated from the Arria 10 FPGA family to the high-end Intel Stratix 10 FPGA family, and users can expect up to nine times performance boost. Besides the impressive increase in performance, the team at the ZTE Wireless Institute sped design time with the use of the OpenCL programming language. With the Intel reference design, and using the Intel SDK for OpenCL to program the FPGA, our development time was greatly shortened," said Xiong Tiankui, chief engineer, ZTE Wireless Institute.
FPGA-Based AI System Recognizes Faces at 1,000 Images per Second EE Times
There is tremendous potential for facial recognition technology, such as informing visually impaired persons if someone they know is approaching them. I find it difficult to believe just how fast things are moving with regard to using artificial neural networks (ANNs) and deep learning techniques (for example, see Deep learning machine vision system aids blind and visually impaired, Deep learning hits a sweet note, Machine learning platform speeds optimization of vision systems, Unlocking the power of AI for all developers, and Push-button generation of deep neural networks). Of course, one really interesting application is to perform object detection and identification, including the really tricky task of recognizing and identifying faces in images and videos. This sort of task benefits from the extreme parallelism offered by FPGAs. Of particular interest are Intel's current generation of FPGAs, whose hard-core DSP slices offer both fixed-point and floating-point capabilities, making them suitable for a wide range of artificial intelligence (AI) and embedded vision applications.
Breakthrough for deep learning with Intel FPGAs
Intel and Chinese telecoms company ZTE claim to have achieved a new record – more than 1000 images per second in facial recognition – with what is known as'theoretical high accuracy' achieved for their custom topology. "Perception, such as recognising a face in an image, is one of the essential goals of the ZTE 5G system," said Duan Xiangyang, vice president of the ZTE Wireless Institute. "Deep learning technology is important as it can enable such perception in mobile edge computing systems, thus making ZTE's 5G system smarter." The test took place in Nanjing, where ZTE's engineers used Intel's Arria 10 FPGA for a cloud inferencing application using a convolutional neural networks (CNN) algorithm. According to the company, the deep learning designs can be migrated from the Arria 10 FPGA family to the Intel Stratix 10 FPGA family, and users can expect up to nine times performance boost.